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' Abstract 

I We consider the problem of signal estimation (denoising) from a statistical mechani- 

^ . cal perspective, using a relationship between the minimum mean square error (MMSE), 

' of estimating a signal, and the mutual information between this signal and its noisy 

. version. The paper consists of essentially two parts. In the first, we derive several 

statistical-mechanical relationships between a few important quantities in this problem 
area, such as the MMSE, the differential entropy, the Fisher information, the free en- 
ergy, and a generalized notion of temperature. We also draw analogies and differences 
I— I. between certain relations pertaining to the estimation problem and the parallel rela- 

' tions in thermodynamics and statistical physics. In the second part of the paper, we 

• . provide several application examples, where we demonstrate how certain analysis tools 

^ ' that are customary in statistical physics, prove useful in the analysis of the MMSE. In 

I— most of these examples, the corresponding statistical-mechanical systems turn out to 
^ ' consist of strong interactions that cause phase transitions, which in turn arc reflected 

^ . as irregularities and discontinuities (similar to threshold effects) in the behavior of the 

Q\ ■ MMSE. 

00 : 

00 ■ Index Terms: Gaussian channel, denoising, de Bruijn's identity, MMSE estimation, 

. phase transitions, random energy model, spin glasses, statistical mechanics. 

(N ■ 

^ ■ 1 Introduction 

■ The relationships and the interplay between Information Theory and Statistical Physics 

^ . have been recognized and exploited for several decades by now. The roots of these rela- 

^ I tionships date back to the celebrated papers by Jaynes from the late fifties of the previous 

■ - - ' century [15, 16], but their aspects and scope have been vastly expanded and deepened 

ever since. Much of the research activity in this interdisciplinary problem area revolves 
around the identification of 'mappings' between problems in Information Theory and cer- 
tain many-particle systems in Statistical Physics, which are analogous at least as far as 
their mathematical formalisms go. One important example is the paralellism and analogy 
between random code ensembles in Information Theory and certain models of disordered 
magnetic materials, known as spin glasses. This analogy was first identified by Sourlas (see. 
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e.g., [27,28]) and has been further studied in the last two decades to a great extent. Beyond 
the fact that these paralelhsms and analogies are academically interesting in their own right, 
they also prove useful and beneficial. Their utility stems from the fact that physical insights, 
as well as statistical mechanical tools and analysis techniques can be harnessed in order to 
advance the knowledge and the understanding with regard to the information-theoretic 
problem under discussion. 

In this context, our work takes place at the meeting point of Information Theory, 
Statistical Physics, and yet another area - Estimation Theory, where the bridge between 
information-theoretic and the estimation-theoretic ingredients of the topic under discus- 
sion is established by an identity [12, Theorem 2], equivalent to the de Bruijn identity 
(cf. e.g., [3, Theorem 17.7.2]), which relates the minimum mean square error (MMSE), of 
estimating a signal in additive white Gaussian noise (AWGN), to the mutual information 
between this signal and its noisy version. We henceforth refer to this relation as the I- 
MMSE relation. It should be pointed out that the present work is not the first to deal 
with the interplay between the I-MMSE relation and statistical mechanics. In an earlier 
paper by Shental and Kanter [26] , the main theme was an attempt to provide an alternative 
proof of the I-MMSE relation, which is rooted in thermodynamics and statistical physics. 
However, to this end, the authors of [26] had to generalize the theory of thermodynamics. 

Our study is greatly triggered by [26] (in its earlier versions) , but it takes a substantially 
different route. Rather than proving the I-MMSE relation, we simply use it in conjunction 
with analysis techniques used in statistical physics. The basic idea that is underlying 
our work is that when the channel input signal is rather complicated (but yet, not too 
complicated), which is the case in certain applications, the mutual information with its 
noisy version can be evaluated using statistical-mechanical analysis techniques, and then 
related to the MMSE using the I-MMSE relation. This combination proves rather powerful, 
because it enables one to distinguish between situations where irregular (i.e., non-smooth 
or even discontinuous) behavior of the mean square error (as a function of the signal-to- 
noise ratio) is due to artifacts of a certain ad-hoc signal estimator, and situations where 
these irregularities are inherent in the model, in the sense that they are apparent even in 
optimum estimation. In the latter situations, these irregularities (or threshold effects) are 
intimately related to phase transitions in the parallel statistical-mechanical systems. 

These motivations set the stage for our study of the relationships between the MMSE 
and statistical mechanics, first of all, in the general level, and then in certain concrete 
applications. Accordingly, the paper consists of two main parts. In the first, which is a 
general theoretical study, we derive several statistical-mechanical relationships between a 
few important quantities such as the MMSE, the differential entropy, the Fisher information, 
the free energy, and a generalized notion of temperature. We also draw analogies and 
differences between certain relations pertaining to the estimation problem and the parallel 
relations in thermodynamics and statistical physics. In the second part of the paper, we 
provide several application examples, where we demonstrate how certain analysis tools that 
are customary in statistical physics (in conjunction with large deviations theory) prove 
useful in the analysis of the MMSE. In light of the motivations described in the previous 
paragraph, in most of these examples, the corresponding statistical-mechanical systems 
turn out to consist of strong interactions that cause phase transitions, which in turn are 
reflected as irregularities and discontinuities in the behavior of the MMSE. 

The remaining part of this paper is organized as follows: In Section [2l we establish a 
few notation conventions and we formalize the setting under discussion. In Section [3l we 
provide the basic background in statistical physics that will be used in the sequel. Section 
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m is devoted to the general theoretical study, and finally, Section [5] includes application 
examples, where the MMSE will be analyzed using statistical-mechanical tools. 

2 Notation Conventions, Formalization and Preliminaries 

2.1 Notation Conventions 

Throughout this paper, scalar random variables (RV's) will be denoted by capital letters, 
like X and Y , their sample values will be denoted by the respective lower case letters, and 
their alphabets will be denoted by the respective calligraphic letters. A similar convention 
will apply to random vectors and their sample values, which will be denoted with the 
same symbols in the boldface font. Thus, for example, X will denote a random n-vector 
{Xi, . . . , Xn), and x = (xi, is a specific vector value in X"^, the n-th Cartesian power 
oiX. 

Sources and channels will be denoted generically by the letters P and Q. The expectation 
operator will be denoted by When the underlying probability measure is indexed by 

a parameter, say, /3, then it will used as a subscript of P, p and unless there is no 
ambiguity. 

For two positive sequences {an} and {6n}, the notation a„ = hn means that an and hn 
are asymptotically of the same exponential order, that is, lim^^oo „ = 0- Similarly, 

o-n < bn means that lim sup^^j^ ^ In < 0, etc. Information theoretic quantities like 
entropies and mutual informations will be denoted following the usual conventions of the 
Information Theory literature. 

2.2 Formalization and Preliminaries 

We consider the simplest variant of the signal estimation problem setting studied in [12], 
with a few slight modifications in notation. Let {X, Y) be a pair of random vectors in IR", 
related by the Gaussian channel 

Y = X + N, (1) 

where AT is a random vector (noise), whose components are i.i.d., zero-mean, Gaussian ran- 
dom variables (RV's) whose variance is where /3 is a given positive constant designating 
the signal-to-noise ratio (SNR), or the inverse temperature in statistical-mechanical point 
of view (cf. Section [3|). It is assumed that X and N are independent. Upon receiving Y , 
one is interested in inferring about the (desired) random vector X. As is well known, the 
best estimator of X given the observation vector Y , in the mean square error (MSE) sense, 
i.e., the MMSE estimator, is the conditional mean X = E{X\Y) and the corresponding 
MMSE, — X\\^ will denoted by mmse(X|l^). Theorem 2 in [12], which provides the 

I~MMSE relation, relates the MMSE to the mutual information I{X\ Y) (defined using the 
natural base logarithm) according to 

dI{X-Y) _ mmse(X|l^) 
dp ~ 2 ■ 

For example, if n = 1 and X ^ AA(0, 1), then I[X\Y) = ^ln(l + /?), which leads to 
mmse(X|y) = 1/(1 + /?), in agreement with elementary results. The relationship has 
been used in [24] to compute the mutual information achieved by low-density parity-check 
(LDPC) codes over Gaussian channels through evaluation of the marginal estimation error. 
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A very important function, which wih be pivotal to our derivation of both E(X\Y) and 
mmse{X\Y), as weh as to the mutual information I{X;Y), is the posterior distribution. 
Denoting the probability mass function of £c by Q{x) and the channel induced by ([T]) by 
P{y\x), then 



Pix\y) 



Q{x)P{y\x)_ 
\x') 

l„ _ ^l|2/9l 

(3) 



Y.^,Q{x')P{y\x') 

Q{x)eM-(i-\\y-xf/'A 



where we defined 

Z{[3\y) = Q{x) exp[-/3 • \\y - xf /2] = (27r//3)"/2p^(y) (4) 



where Ppiy) is the channel output density. Here we have assumed that x is discrete, as 
otherwise Q should be replaced by the probability density function (pdf ) and the summation 
over {a;'} should be replaced by an integral. The function Z[f3\y) is very similar to the 
so-called partition function, which is well known to play a very central role in statistical 
mechanics, and will also play a central role in our analysis. In the next section, we then 
give some necessary background in statistical mechanics that will be essential to our study. 



3 Physics Background 

Consider a physical system with n particles, which can be in a variety of microscopic 
states ('microstates'), defined by combinations of physical quantities associated with these 
particles, e.g., positions, momenta, angular momenta, spins, etc., of all n particles. For each 
such microstate of the system, which we shall designate by a vector x = (xi, . . . , x^), there 
is an associated energy, given by a Hamiltonian (energy function), £{x). For example, if 
Xi = (pi,ri), where Pj is the momentum vector of particle number i and rj is its position 

2m +m9Zi 



^ ' \\T) 1 1 

vector, then classically, £{x) = Yli=i 



, where m is the mass of each particle. 



Zi is its height - one of the coordinates of rj, and g is the gravitation constant. 

One of the most fundamental results in statistical physics (based on the law of en- 
ergy conservation and the basic postulate that all microstates of the same energy level 
are equiprobable) is that when the system is in thermal equilibrium with its environment, 
the probability of finding the system in a microstate x is given by the Boltzmann-Gibbs 
distribution 

-I3e{x) 

where /? = l/(/cT), k being Boltmann's constant and T being temperature, and Z{(3) is the 
normalization constant, called the partition function, which is given by 



pe{x) 

X 



assuming discrete states. In case of continuous state space, the partition function is defined 
as 



Z{P) = Jdxe-'''^^\ 
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and P{x) is understood as a pdf. The role of the partition function is by far deeper than just 
being a normalization factor, as it is actually the key quantity from which many macroscopic 
physical quantities can be derived, for example, the free energ}0 is F{P) = lnZ(/3), the 

average internal energy is given by £" = E{£{X)} = — (d/d/?) lnZ(/3) with X ~ P{x), the 
heat capacity is obtained from the second derivative, etc. One of the ways to obtain eq. 
([5]), is as the maximum entropy distribution under an average energy constraint (owing to 
the second law of thermodynamics), where f3 plays the role of a Lagrange multiplier that 
controls the average energy. 

An important special case, which is very relevant both in physics and in the study 
of AWGN channel considered here, is the case where the Hamiltonian £{x) is additive 
and quadratic (or "harmonic" in the physics terminology), i.e., £{x) = "^^^i ^i^xf, for 
some constant k > 0, or even more generally, £{x) = ^^^i ^Kixf, which means that the 
components {xi} are Gaussian and independent. A classical result in this case, known as 
the equipartition theorem of energy, which is very easy to show, asserts that each particle 
(or, more precisely, each degree of freedom) contributes an average energy of E{^KiXf} = 
1/(2/?) = kT/2 independently of k (or Kj). 

Returning to the case of a general Hamiltonian, it is instructive to relate the Shannon 
entropy, pertaining to the Boltzmann-Gibbs distribution, to the quantities we have seen 
thus far. Specifically, the Shannon entropy S{(3) = —E{ln P(X)} associated with P{x) = 
e-P^i^) /Z{I3), is given by 



S{I3) = E\u 
where, as mentioned above. 



m 

,-f5e{x) 



lnZ{(3)+P-E, 



is the average internal energy. This suggests the differential equation 

where il>{P) = — lnZ(/3) and ij) means the derivative of ip. Equivalently, eq. ([7]) can be 
rewritten as: 
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^^^^ (8) 



whose solution is easily found to be 



oo 



m=PEo-Pl (9) 

where Eq = mina; ^{x) is the ground-state energy, here obtained as a constant of integration 
by examining the limit of /J — > oo. Thus, we see that the log-partition function at a given 
temperature can be expressed as a heat integral of the entropy, namely, as an integral of 
a function that consists of the entropy at all lower temperatures. This is different from 



^The free energy means the maximum work that the system can carry out in any process of fixed 
temperature. The maximum is obtained when the process is reversible (slow, quasi-static changes in the 
system) . 
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the other relations we mentioned thus far, which were all 'pointwise' in the temperature 
domain, in the sense that all quantities were pertaining to the same temperature. Taking 
the derivative of ip{P) according to eq. Q, we obtain the average internal energy: 

E.Mm-E„-r^i^,'-m, (10) 

where the first two terms form the free energyH 

As a final remark, we should note that although the expression Z(f3\y) of eq. is similar 
to that of Z(fi) defined in this section (for a quadratic Hamiltonian) , there is nevertheless 
a small difference: The exponentials in ^ are weighted by probabilities {Q{x)}, which are 
independent of f3. However, as explained in [17, p. 3713], this is not an essential difference 
because these weights can be interpreted as degeneracy of states, that is, as multiple states 
(whose number is proportional to Q{x)) of the same energy. 

4 Theoretical Derivations 

Consider the Gaussian channel ([1]) and the corresponding posterior ([3]). Denoting by Ep 
the expectation operator w.r.t. joint pdf of {X , Y) induced by (3, we have: 

/(X;y) = £;Jln^[-^-|l^-^ll'/^] 



Z{P\Y) 

^Ef,{\\Y - Xf} - Ep{lnZ{(3\Y)} 

^ -Ef, {In Z{P\Y)} (11) 



where we use the fact that -E/j { — = £^^|||A?"p| = n/f3. Taking derivatives 

w.r.t. /?, and using the I-MMSE relation, we then have: 

f^5!S|^.^?£(^._|^,,(,„z,^^^^^ (12) 

and so, we obtain a very simple relation between the MMSE and the partition function of 
the posterior: 

inmse{X\Y) = -2^Efs{lnZ{P\Y)} (13) 

By calculating the derivative of the right-hand side (r.h.s.) more explicitly, one further 
obtains the following: 

-—Ep\nZ{(5\Y) = -— j^Jy.Pp{y)\nZ{f3\y) 

l.p,^,fJ^r ,14) 

JjR" op JR" op 



^By changing the integration variable from j3 to T, this is identified with the relation F = Eq — SdT' , 
which together with F — E — ST, complies with the relation E = Eq + TdS' — Eo + dQ' , accounting 
for the simple fact that in the absence of any external work applied to the system, the internal energy is 
simply the heat accumulated as temperature is raised from to T. 
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Now, the first term at the right -most side of (jl4p can easily be computed by using the fact 
that In Z{P\y) is a log-moment generating function of the energy (as is customarily done 
in statistical mechanics, cf. eq. d^D), which implies that it is given by -E/3{||"K — = 
n/(2/3) = nkT/2, just like in the energy equipartition theorem for quadratic Hamiltonians. 
As for the second term, we have 

<iy.^.lnZ(*) 



•exp{-/3||y-^||V2}lnZ(/3|y) 



27r\-"/2 



— \\v — x\\ 

2(3 2"^ " 



= -^Coy{\\Y - XfMZ{P\Y)}. (15) 

The MMSE is then given by 

mmse{X\Y) = -2— E p{ln Z{(3\Y)} = ^ + Cov{||r - Xf , In Z(/3|r)}, (16) 

which can then be viewed as a variant of the energy equipartition theorem with a correction 
term that stems from the fact the pdf of Y depends on /3. 

Another look, from an estimation-theoretic point of view, at this expression reveals the 
following: The first term, n//? = — X|p, is the amount of noise in the raw data Y, 

without any processing. The second term, which is always negative, designates then the 
noise suppression level due to MMSE estimation relative to the raw data. The intuition 
behind the covariance term is that when the 'correct' x (the one that actually feeds the 
Gaussian channel) dominates the partition function then In Z{P\Y) ~ — /3||"K — X|p/2, 
and so, there is a very strong negative correlation between \\Y — X|p and lnZ((3\Y). In 
particular, 

Cov{||F - Xf, -P\\Y - X|| 2/2} = (17) 

which exactly cancels the above-mentioned first term, n/P, and so, the overall MMSE 
essentially vanishes. When the correct x is not dominant, this correlation is weaker. Also, 
note that since 

E\\Y - = mmse(X|y) + E\\Y - £;(X|y)f , (18) 
then this implies that 

E\\Y - E{X\Y)\\^ = -Cov{||r - Xf,lnZ{f3\Y)}. (19) 

It is now interesting to relate the noise suppression level 

A = E\\Y - E{X\Y)f = -Cov{||F - Xf,\nZ{P\Y)} 



to the Fisher information matrix and then to a new generalized notion of temperature due to 
Narayanan and Srinivasa [21] via the de Bruijn identity. According to de Bruijn's identity, 
if W is a vector of i.i.d. standard normal components, independent of X, then 

^h{X + ViW) = ^tr{ J(X + ViW)} 
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where h{Y) is differential entropy and J{Y) is the Fisher information matrix associated 
with Y w.r.t. a translation parameter, namely, 



tr{J{Y)} = Y,E 



1=1 



d\nPp{y) 



y=Y, 



E 



dy 



dyi 



Note that since Pf}{y) and Z{P\y) differ only by a multiplicative factor of (/?/27r)"/^, it is 
obvious that d\nPp{y) /dyi = d\nZ{(3\y)/dyi and so, the Fisher information can also be 
related directly to the free energy by 



iv{J{Y)} = Y,E 

i=l 
n 





'd\nZ{P\y) 




1 




dyi 


y=Y_ 





Y,E{[E{-m-x.)\Y}n 

i=l 

n 

fi^Y.E{E\N,\Y)], 



(20) 



i=l 



where Ni = Yi — Xi and where we have used the fact that the derivative of exp{— /3||y — jcp} 
w.r.t. yi is given by —f5{yi — Xi) ■ exp{— — a;|p}. Now, as is also shown in [12]: 



I{X- X + N) = I{X; X + W/^/p) 

= h{X + w/^/p) - h{W/^/p) 



h{X + W/y/p)--\n[27:e/(3). 



(21) 



Thus, 



mmse(X|X + A/") = 2 



dI{X-X + N) 

'dp 



dh{X;X + W/y/P) n 

8(3 (3 



(22) 



where the factor —1//?^ in front of the Fisher information term accounts for the passage 
from the variable t to the variable f3 = 1/t, as dt/ df3 = —1/0^. Combining this with 
the previously obtained relations, we see that the noise suppression level due to MMSE 
estimation is given by 

tr{J(l^)} 



A 



/32 



In [21, Theorem 3.1], a generalized definition of the inverse temperature is proposed, as 
the response of the entropy to small energy perturbations, using de Bruijn's identity. As a 
consequence of that definition, the generalized inverse temperature in [21] turns out to be 
proportional to the Fisher information of Y , and thus, in our setting, it is also proportional 
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to p^aE It should be pointed out that whenever the system undergoes a phase transition 
(as is the case with most of our forthcoming examples), then A, and hence also the effective 
temperature, may exhibit a non-smooth behavior, or even a discontinuity. 

Additional relationships can be obtained in analogy to certain relations in statistical 
thermodynamics that were mentioned in Section O Consider again the chain of equalities 
(jlip . but this time, instead using the relation — = n/(3, in the passage from 



the second to the third line, we use the relation — = —E/3{-^ In Z{f3\Y)} in 

conjunction with the identity (cf. eq. (jl4p ): 

-■l^^f-'^CI^" +icov{||V-X||M„Z(,5|y)}, (23) 



to obtain 



d/3 2 



Ep{\nZ{(3\Y)]-l3-—Ep{\nZ{l3\Y) = ^Cov{\\Y-XrMZ{l3\Y)]-I{X-Y). (24) 

Thus, redefining the function V'(/3) as 

^{l3) = -Ep{\nZ{(3\Y)], (25) 
we obtain the following differential equation which is very similar to ([7]): 

where 

S(/?) = ^Cov{||l- - X\\\\n Z{[3\Y)] - I{X; Y). (27) 

Thus, the solution to this equation is precisely the same as ([9]), except that S{P) is replaced 
by S(/9) and the ground-state energy Eq is redefined as 

Eq = £;^{nnn||l" - xf}. 
Consequently, mmse(X|"K) = 2^(/3), where 

and one can easily identify the contributions of the free energy and the internal energy 
(heat), as was done in Section [3l 

To summarize, we see that the I-MMSE relation gives rise essentially similar relations as 
in statistical thermodynamics except that the "effective entropy" S(/3) includes correction 
terms that account for the fact that our ensemble corresponds to a posterior distribution 
P{x\y) and the fact that the distribution of Y depends on (3. 



As is shown in [21], the generalized inverse temperature coincides with the ordinary inverse temperature 
when Y is purely Gaussian with variance proportional to 1//3, i.e., the ordinary Boltzmann distribution with 
a quadratic Hamiltonian. In our setting, on the other hand, Y is given by a mixture of Gaussians whose 
weights are independent of /3. To avoid confusion, it is important to emphasize that the original parameter 
/3, in our setting, pertains to the Boltzmann form of the distribution of X given Y — y according to the pos- 
terior P{x\y), whereas the current discussion concerns the temperature associated with the (unconditional) 
ensemble oiY = X + N. 
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5 Examples 



In this section, we provide a few examples where we show how the asymptotic MMSE 
can be calculated by using the I-MMSE relation in conjunction with statistical-mechanical 
techniques for evaluating the mutual information, or the partition function pertaining to 
the posterior distribution. 

After the first example, of a Gaussian i.i.d. channel input, which is elementary, we turn to 
explore three examples where the channel input is a randomly selected codebook vector from 
a certain ensemble of codebooks that comply with a power constraint } < P^. 

There could be various motivations for MMSE estimation when the desired signal is a 
codeword: One example is that of a user that, in addition to its desired signal, receives 
also a relatively strong interfering signal, which carries digital information (a codeword) 
intended to other users, and which comes from a codebook whose rate exceeds the capacity 
of this crosstalk channel between the interferer and our user, so that the user cannot fully 
decode this interference. Nonetheless, our user would like to estimate it as accurately as 
possible in order to subtract it and thereby perform interference cancellation. 

In the first example of a code ensemble (Subsection I5.2p . we deal with a simple ensemble 
of block codes, and we demonstrate that the MMSE exhibits a phase transition at the value 
of 13 for which the channel capacity C{f3) = ^ln(l + fiPx) agrees with the coding rate 
R. The second ensemble (Subsection 15. 3p consists of an hierarchical structure which is 
suitable for the Gaussian broadcast channel. Here, we will observe two phase transitions, 
one corresponding to the weak user and one ~ to the strong user. The third ensemble 
(Subsection 15. 4p is also hierarchical, but in a different way: here the hierarchy corresponds 
to that of a tree structured code that works in two (or more) segments. In this case, 
there could be either one phase transition or two, depending on the coding rates at the 
two segments (see also [19]). Our last example is not related to coding applications, and 
it is based on a very simple model of sparse signals which is motivated by compressed 
sensing applications. Here we show that phase transitions can be present when the signal 
components are strongly correlated. 

The statistical-mechanical considerations in this section provide unique insight into the 
coding and estimation problems, in particular by examining the typical behavior of the 
geometry of the free energy. This is in fact related to the notion of joint typicality for 
proving coding theorems, but more concrete geometry is seen due to the special structures 
of the code ensembles. In some of the ensuing examples, the mutual information can also 
be obtained through existing channel capacity results from information theory. In the last 
example pertaining to sparse signals (Subsection 15. 5p . however, we are not aware of any 
alternative to the calculation using statistical mechanical techniques. 

5.1 Gaussian I.I.D. Input 

Our first example is very simple: Here, the components of X are zero-mean, i.i.d., Gaussian 
RV's with variance Px- In this case, we readily obtain 



Z{f3\y) 



exp{-||y||V[2(P. + l//?)]} 
(l + /3P,)"/2 



thus 




10 



Clearly, 

n T) 
Ep\nZ{(5\Y) = + (3P,) - - 

and its negative derivative is nPx/[2,{\ + /J-P^)], which is indeed half of the MMSE. Here, 
we have: 

n nPx n 

and 



tr{J(l^)} = nE 



Y 



2 



n/3 



1 + fiPx 



and so, the relation tr{J("K)} = fP/S. is easily verified. Thus, the generalized temperature 
here is /3/(l + PPx), which is the reciprocal of the variance of the Gaussian output. 

5.2 Random Codebook on a Sphere Surface 

Let X assume a uniform distribution over a codebook C = {xi, . . . , X]\f}, M = e"-^, where 
each codeword Xi is drawn independently under the uniform distribution over the surface of 
the n-dimensional sphere, which is centered at the origin, and whose radius is \JnPx. The 
code is capacity achieving (the input becomes essentially i.i.d. Gaussian as n — > oo). In the 
following we show that the MMSE vanishes if the code rate R is below channel capacity, 
but is no different than that of i.i.d. Gaussian input (without code structure) if R exceeds 
the capacity. We note that such a phase transition has been shown for good binary codes 
in general in [25] using the I-MMSE relationship. 
Here, for a given we have: 

Z(/3|2/) = ^ e-"^exp[-/3||2/ - a;||V2] 
a;eC 

= e-"«exp[-/3||y - x^f I2\ + e"'^^ exp[-/3||y - xf I2\ 

x(iC\{x^'\ 

= z,{m + zMy) (28) 

where, without loss of generality, we assume xq to be the transmitted codeword. Now, 
since ||y — a;o|P is typically around n/f3, Zc{P\y) would typically be about e~"^e~^'"/^^^^ = 
g-n(fl+i/2)_ fQj, Ze{P\y), we have: 

ZMy) = e-"^ / deiV(6)e-^"% 

where N{e) is the number of codewords {x} in C — {xq} for which ||y — a;|p/2 ~ ne, namely, 
between ne and n(e + de). Now, given y, N{e) = X^fl^ ^{xi : ||y — a;j|p/2 « ne} is the sum 
of M i.i.d. Bernoulli RV's and so, its expectation is 

M 

NiT) = Y,M\\y - Xif/2 « ne} = e"«Pr{||y - X,f/2 ^ ne}. (29) 

1=1 

Denoting Py = ^ Yl^=i vf (typically, Py is about Px + 1//3), the event \\y — x\\'^/2 « ne is 
equivalent to the event {x,y) ^ [{Px + Py)/2 — e]n or equivalently. 



p{x,y, 



A {x,y) ^^ \{Px + Py)-e A Pg-e 

'^\J PxPy \l PxPy Pg 
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where have defined Pa = {Px + Py)/'^ and Pg = \fPxPy (the arithmetic and the geometric 
means between and Py, respectively). The probabihty that a randomly chosen vector X. 
on the sphere would have an empirical correlation coefficient p with a given vector y (that 
is, X. falls within a cone of half angle arccos(/9) around y) is exponentially exp[^ ln(l — p^)]. 
For convenience, let us define 

F(p) = lln(l-p2) 

so that we can write 



Pr{||t/ - X1IIV2 ~ ne} = exp <^ nF 



Pa 



Pa 



From this point and onward, our considerations are very similar to those that have been 
used in the random energy model (REM) of spin glasses in statistical mechanics [5-7], a 
model of disordered magnetic materials where the energy levels pertaining to the various 
configurations of the system \E{x)\ are i.i.d. RV's. These considerations have already 
been applied in the analogous analysis of random code ensemble performance, where the 
randomly chosen codewords give rise to random scores that play the same role as the random 
energies of the REM. The reader is referred to [27], [28], [20, Chapters 5,6], and [18] for a 
more detailed account of these ideas. 

Applied to the random code ensemble considered here, the line of thought is as follows: 
If e is such that 

''-'-^ -H. 



Pa 



> 



then the energy level e will be typically populated with an exponential number of codewords, 
concentrated very strongly around its mean 



N{e) 



exp < n 



R + T 



Pa 



Pa 



otherwise (which means that A^(e) is exponentially small), the energy level e will not be 
populated by any codewords typically. This means that the populated energy levels range 
between 



ei=Pa-PaVT 



-2R 



and 



e2 = Pa + PWl-e- 



2R 



or equivalently, the populated values of p range between — and where p* = \/l — e^^-R.. 
By large deviations and saddle-point methods [4,11], it follows that for a typical realization 
of the randomly chosen code, we have 



\y) = e "■^ max exp n 

eG[ei,e2] 



max exp n 

f:e[ei,<:2] 



exp |n 



R + T 



Pa 



Pa 



Pa 



/3e 



(3e 



max <^ -ln{l -p')-P{Pa-pPg, 
\p\<p* I ^ 
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The derivative of i ln(l - p^) + pPPg w.r.t. p vanishes within [—1, 1] at: 



A 



P = pp = vTT^ - e 

where 

„ A 1 



This is the maximizer as long as \/l + — 6 < p^,, namely, 9 > e~'^^/2p^, or equivalently, 

13 < p^e'^^/Pg, which for Pg = y^PjP^^TTJP) , is equivalent to P < = (e^^ - 1)/Px. 
Thus, for the typical code we have 

6(3 R)^ hm ^"^^(^'^^ = /^^""(^ - - ^^^'^ - P^^s^^ ^ < 

Taking now into account Zc(/3|i/), it is easy to see that for (3 > (3r (which means R < C), 
Zc{f3\y) dominates Ze(/3|^/), whereas for /3 < /3r it is the other way around. It follows then 
that 

ct^{P,R)^ lnZ(^^ Uln(l-/,2)_/j(p^_p^P^), p < 



n—foo 



On substituting P„ = P^ + 1/(2/?), Pg = ^^PjP^+l/p) and 

pp = Vl + 9^ - 9 -- 



fiPx 



; l + fiP. 
we then get: 

\uZ{(3\y) _\\\n{l + f3P^) + \, [5 < (3r 



= — lim 



n 



\R + \ f3>PR. 
Note that is a continuous function but it is not smooth at /3 = Pr. Now, 

mmse(X|y) _ ^ d-0(/3) _ ij^^, 

n d/3 |o, P>Pr. 

which means that there is a first order phase transitior0 in the MMSE: As long as /3 > Pr, 
which means R < C, the MMSE essentially vanishes since the correct codeword can be 
reliably decoded, whereas for R > C, the MMSE behaves as if the inputs were i.i.d. Gaussian 
with variance Px (cf. Subsection 15. Ih . 



5.3 Hierarchical Code Ensemble for the Degraded Broadcast Channel 

Consider the following hierarchical code ensemble: First, randomly draw Mi = e^^^ cloud- 
center vectors {itj} on the y^-sphere. Then, for each Uj, randomly draw M2 = e^^"^ 
codewords {a^ij} according to ajjj = aui + Vl — fjj', where {I'tj} are randomly drawn 
uniformly and independently on the -y/n-sphere. This means that ||iCjj — = n(l — 

*By "first-order phase transition", we mean, in this context, that the MMSE is a discontinuous function 
of /3. 
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a^) = nb. Without essential loss of generality, here and in Subsection 15.41 we take the 
channel input power to be P^; = 1- 

Let xqo, belonging to cloud center uq, be the input to the Gaussian channel ([1]). It is 
easy to see that if the SNR of the Gaussian channel is high enough, the codeword Xi j C0jI1 
be decoded; while at certain lower SNR only the cloud center Ui can be decoded but not 
Vi^j. In the following we show the phase transitions of the MMSE as a function of the SNR. 

We will decompose the partition function as follows: 



'V2) 



Z(/3|t/) = e-"^^exp(-/3||2/-a;,,,| 

= e-"^exp(-/3||y - rro,o||V2) + e""^ J] exp(-/?||y - /2) 



i>i 



A 



Z,{P\y) + ZMy) + ZMy) 



(31) 



where once again, Zc{(3\y) - the contribution of the correct codeword, is typically about 
g-n(_R+i/2)_ 'pj^g other two terms Zei(/3|y) and Zf.2{l3\y) correspond to contributions of 
incorrect codewords from the same cloud and from other clouds, respectively. 

Let us consider Zei(/3|y) first. The distance \\y — iCoj-p is decomposed as follows: 



\y - XQjf = \\{y - auo) + (alio - a^OjOf 

= \\y - awolP + \\auo - xojW"^ + 2{y - auo, auo - xq^^ 



(32) 



A 



Now, \\y — auoW is typically about n//3 + nb = na and ||aito — ^Jo.jll = Thus, for 

||y—a;oj lP/2 to be around ne, {y—auQ,auQ—XQj) must be around ri[e—(a+6)/2] = n[e—Pa]. 
Now, the question is this: Given y — auQ, what is the typical number of codewords in cloud 
for which {y — aUQ, aUQ — XQj) = n[e — Pa]. Similarly as before, the answer is the following: 



N{e) = 



exp |n 
0, 



i?2 + r 



e~Pa 



}, e£[Pa-p2Pg,Pa + P2Pg] 

elsewhere 



(33) 



where Pr. 



A 



Vab and p2 = VT^-e-^. Thus, 

Zei{P\y 



e exp < n 



e ""^^ exp < n 



max {R2 + r(p) - p{Pa - pPg)} 

P\<P2 



max 

\P\<P2 



ln(l - p') + PpPg \ - (3P, 



(34) 



As before, the derivative of ln(l — p^) + pPPg] w.r.t. p vanishes within [—1, 1] at: 

P = pp = - 

where 



A 



1 

2PPg 
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This is the maximizer as long as Vl + — < p2-, namely, 6 > e ^^'^/2p2, or equivalently, 
(3 < P2e^^yPg, which for Pg = ^6(6 + 1//?), is equivalent to (3 < /3(i?2) = (e^-^^ _ i)/^. 
Thus, for the typical code we have 



MP) 



lim 



InZ 



ell 



n 



' Ri-\\n{l- pl)+l3{Pa- ppPg), 

R + (){Pa- P2Pg), 



Similarly as before, it is easy to see that 



P > P{R2) . 



el 



exp 



-n 



Ri + ui{n{R2,\\n{l+hp)\ + 



Turning now to Ze2(/3|l/), we have the following consideration. Given itj, i > 1, let y' = 
y — aUi and = ajj j — aui. We would like to estimate how many codewords in cloud i, 
Ni{e), contribute \\y — Xij\^j2 = \\y' — i7jj|p/2 = ne. Similarly as before, Ni{e) is given 
by exactly the same formula as (j33p where this time, Pa = (l — a^ + ||i/ — aitj|p/n)/2 and 
Pg = Y^(l — a^^lly _ aUiW^ /n. Thus, we have expressed the typical number of codewords 
that cloud i contributes with energy e as Ni{e) = exp{nF(||y — atij|p/n, e)}, and the total 
number is A^(e) = Ni{e). Now let M{6) be the number of {itj} for which Hj/ — = 
6. Then, 

N{e) = Y M{6)e''^^^''\ 



Now, 



M{6) 



s 



exp {n + r 

.0, 



} , 5 G [61,62] , 
elsewhere 



where = (1 + l/P + a2)/2, = ay^l+T/P, = ^(Pa " P^l-e-^^i) ^ 2(P^ - piP^ 
and 52 = 2(P^ + P'/9i). Thus, 



N{e) 



exp < n max 

1^ <5i<5<<52 



i?i + r 



p' 



p' 
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Putting it all together, we get: 



MP) 



lim 

n^oo 



InZ, 



e2[ 



\y) 



n 



max max 

ki|<Pl k2|<P2(r-i) 



iln(l-r2) + iln(l-r2)- 



P 



a 



+ - nP' - r2 J2(l - a2)(P^ - nP') 



(35) 



where = Vl - /92(ri) = ^1 - e"^^/(l - -^a = (1 + V/? + a^)/2, and P^ = 

Oa/I + 1/ p. The above expression does not seem to lend itself to closed form analysis in 
an easy manner. Numerical results (cf. Fig. 1) show a reasonable match (within the order 
of magnitude of 1 x 10~^) between values of lim„_»oo Y)/n obtained numerically from 
the asymptotic exponent of Epln Z(P\Y) and those that are obtained from the expected 
behavior in this case: 



lim 

n— »oo 



I(X-Y) 



n 



filn(l + /3), /5</3i 
Pi + iln(l + /36), pi<P<P2 
R = Ri + R2, P>p2 
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0.8 




beta 



Figure 1: Graph of liuin^oc I {X;Y)/n = - E p{ln Z{P\Y)} /n - 1/2 as a function of (3 for 
Ri = 0.1, R2 = 0.6206, and a = 0.7129, which result in /3i = 0.5545 and P2 = 5.001. As 
can be seen quite clearly, there are phase transitions at these values of /3. 



where 

A e^^i - 1 ^ A €^^2 _ I 



l_5e2^?i ' 1-6 ' 

and it is assumed that the parameters of the model R2 and a) are chosen such that 
/?! < 132- Accordingly, the MMSE undergoes two phase transitions, where it behaves as 
if the input was: (i) Gaussian i.i.d. with unit variance for (3 < (3i (where no information 
can be decoded), (ii) Gaussian input of a smaller variance (corresponding to the cloud), 
in the intermediate range (where the cloud center is decodable, but the refined message is 
not), and (iii) the MMSE altogether vanishes for [3 > (32, where both messages are reliably 
decodable. 

The hierarchical code ensemble takes the superposition code structure which achieves 
the capacity region of the Gaussian broadcast channel. Consider two receivers, referred to 
as receiver 1 and receiver 2, with f3i and (32 respectively. Receiver 1 can decode the cloud 
center, whereas receiver 2 can decode the entire codeword. In other words, suppose the 
hierarchical code ensemble with rate pair (i?i,i?2) and parameter a is sent to two receivers 
with fixed SNR of 71 and 72 respectively. Then the minimum decoding error probability 
vanishes as long as (i?i, i?2, ct) are such that 

^'<i-«('+ l + (f-U7. )- 

i?2 < ^log(l + a2^2). (37) 
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In particular, all boundary points of the capacity region can be achieved by varying the 
power distribution coefficient a. This capacity region result also leads to the fact that if 
only the cloud center is decodable, then the MMSE for the codeword t^jj is no different 
to that if the elements of were i.i.d. standard Gaussian. Knowledge of the codebook 
structure of {I'ij} does not reduce the MMSE because otherwise the code cannot achieve 
the capacity region of the Gaussian broadcast channel. 



5.4 Hierarchical Tree-Structured Code 

Consider next an hierarchical code with the following structure: The block of length n 
is partitioned into two segments, the first is of length n\ = Ain (Ai G (0, 1)) and the 
second is of length n2 = A2^^ (A2 = 1 — Ai). We randomly draw Mi = e"^^^ first-segment 
codewords {xi} on the surface of the y^nj-sphere, and then, for each a;,, we randomly draw 
M2 = e'^^^^ second-segment codewords {x'^ j} on the surface of the y^-sphere. The total 
message of length nR = niRi + n2ii2 (thus R = Aii?i + X2R2) is encoded in two parts: 
The first-segment codeword depends only on the first niRi bits of the message whereas the 
second-segment codeword depends on the entire message. 

Let (a;o,a;o,o) be the transmitted codeword, and let y and y' be the corresponding 
segments of the channel output vector {y,y')- The partition function is as follows: 

Z{f3\y) = e-^^ exp{-/3[||2/ - x^f + \\y' - xoflf]/2} 

+ e-i? exp{-/3[||2/ - x^^ /2] exp{-/5||y' - a^o,, f]/2] 

j 

+ e"""" E E exp{-/5[||2/ - x,f/2} exp{-f3\\y' - x,j'']/2} 



A 



Z, + Zel+Ze2. (38) 



Now, as before, Zc = e '^(^+1/2)^ As for Zgi, it can also be treated as in Subsection [521 The 
first factor contributes e"''^ • e""^i/2. The second factor is e-"^2[™''^i'f^2''^(^)Hi/2] ^ where 
C(/3) = iln(l + /3). Thus, 



ZeiiP\y) + Zc = exp <^ -n 



Aii?i+ X2mm{R2,Cij3)} + ^ 

Consider next the term Ze2- Let ri = {x,y) /{niPg) and r2 = {x' ,y') /{n2Pg) where Pg 
is as in Subsection 15.21 Of course, {{x,x'),{y,y'))/{nPg) = Airi + A2r2. What is the 
typical number of codewords {xi,x'-j) of Zf>2 whose correlation with {y,y') is exactly r? 
The answer is 

^.^lniV(r)^ rnax !. XiRi + Air(ri) + X2R2 + A2r ( "^ ~ ^^'"^^ 

n \ri\<p{Ri) [ V -^2 



where p{x) = \/l — e This expression behaves differently depending on whether Ri > 
R2 or Ri < i?2- In the first case, it behaves exactly as in the ordinary ensemble, that is: 

In N(r) fi?+ iln(l - r2), \r\ < p{R) 
lim = < 

"^00 n I 0, |r| > p{R) . 
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and then, of course, is as before: 

Ze2 + = exp{-n[min{i?, C7(/3)} + 1/2]}. 
When Ri < R2, however, we have two phase transitions: 



Urn 

n^oo 



hi iV(r) 



n 



R + T(r) 



A2 
0, 



R2 + T 



In this case, we get: 

ln(Ze2 + Zc) 



n—KX) 



n 



r~Xip{Ri) 
A2 



\r\ < p{Ri) 

p{Ri) < \r\ < \ip{Ri) + \2P{R2) 
\r\ > \ip{Ri) + \2p{R2) ■ 



f3 < f3{R 



-Aii?i - A2C(/3) - i P{Ri) <P< (5{R2) 



-R- i. 



/3 > /3(i?2) 
^ ln(l + /3) = i2. To summarize, we 



where f3{R) is the solution (3 to the equation C{f3) 
have the fohowing: = e-"(^+i/2), + Z^ = exp{-n[Aii?i + A2 min{i?2, C{(3)} + 1/2]} 
and 



-^e2 + Zc 



I exp{-n[min{i2, (:;(/3)} + 1/2]}, Ri > R2 

|^exp{-n[Ai mm{Ri,C{l3)} + A2 min{i?2, C(/5)} + 1/2]}, i?i < R2 

Clearly, if Ri < R2 then Ze2 + -^c dominates Zei + Zc- Ri > R2, we note that 

min{Aiiii + A2 min{i?2, C(/?)}, min{i?, C(/?)}} = min{i?, C(/3)}. 

Thus, 

^ ^ fexp{-n[min{i2, C(/3)} + 1/2]}, i?i > R2 

\exp{-?i[Ai min{i?i, C(/3)} + A2 min{i?2, C{(3)} + 1/2]}, Ri < R2 . 

The MMSE then is as in ([30]) in Subsection 15.21 when Ri > R2, and given by 



inmse{X\Y) 



1 

A2 



/3 < P{Ri) 

mi) <p< p{R2) 



(39) 



when i?i < i?2- This dichotomy between these two types of behavior have their roots in 
the behavior of the GREM, a generalized version of the random energy model, where the 
random energy levels of the various system configurations are correlated (rather than being 
i.i.d.) in an hierarchical structure [8-10]. The GREM turns out to have an intimate analogy 
with the tree-structured code ensemble considered here. The reader is referred to [19] for 
a more elaborate discussion on this topic. 

The preceding result on the MMSE is consistent with the analysis based solely on in- 
formation theoretic considerations. In case Ri < R2, the first segment code is decodable as 
long as Ri < (1/2) log (1 + /?), whereas the second segment code is decodable if also R2 < 
(l/2)log(l + (3). Hence the MMSE is given by (p9]) . In case Ri > R2, the second-segment 
code is decodable if and only if the first-segment is also decodable, i.e., the two codes can 
be decoded jointly. This requires R2 < (1/2) log(l + [5), XiRi < Ai log(l + P) + X2 log(l + (5) 
and R = XiRi + \2R2 < log(l + (3). The last inequality dominates, hence the MMSE is 
given by ([30]) . 
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5.5 Estimation of Sparse Signals 



Let the components of X be given by Xi = SiUi, i = 1,2, ... ,n, where Si G {0, 1} and 
{Ui} are J\f{0,a'^) i.i.d. and independent of {-'^^j}. As before Y = X + N, where the 
components of AT" are i.i.d. Gaussian M{0, One motivation of this simple model is in 
compressed sensing applications, where the signal X (possibly, in some transform domain) 
is assumed to possess a limited fraction of non-zero components, here designated by the 
non-zero components of S = {Si, S2, ■ ■ ■ , Sn)- The signal X is considered sparse if the 
relative fraction of I's in S is small. We will assume that S, whose realization is not 
revealed to the estimator, is governed by a given probability distribution P{s). We first 
derive an expression of the partition function for a general P{s) and then particularize our 
study to a certain form of P{s). First, we have the following: 

P{x) = ^P{s)P{x\s) 
s 

= E^(^) n ^(^^) n [(2vra2)-'/'exp{-x?/(2a2)} 

S i: Si=0 i: Si = l 

n 

= Y.P{s)\{[{27,Sia^)-^'^eM-^l/{^s,a^)) 



(40) 



where a zero-variance Gaussian distribution is understood to be equivalent to the Dirac 
delta-function. Thus, 



Z{l3\y)= [ dxP{x)eM-P\\y-xfm 



s i=l 

n 

S i=l 

n 

^P(5)nexp 



dxj(27rsjO" 



2\-l/2 



i=l 



(1 + qsi 



-1/2 exp 
1 + qsi 



exp{-x^/i2sia^)} ■ exp{-(3{yi - Xif /2] 



2(l + gsi) 
+ ln(l + qsi) 



where we have used the notatior0 q = I5a'^ . Transforming s to "spins" /2 = (/ii, 
the relation fii = 1 — 2si £ {—1, +1}, we get: 

j?^L + i„(i + . (i±imi + i ,„(i + ,) - 2,,/., 

l + qsi 1 + g 2 



(41) 
, l^n) by 



where 



hi 



4(l + /3cj2) 4 
On substituting back into the partition function we get: 



Zif3\y) = {l + . 



-n./4 . 



exp 



(3{l+q/2) 2 
2{l+q) 



J /X U=i 



(42) 



(43) 



The quantity q is proportional to the SNR. 
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Thus hi is given the statistical-mechanical interpretation of the random 'local' magnetic 
field felt by the i-th spin. 

Eq. (|43p holds for a general distribution P{s) or equivalently, -P(/x). To further develop 
this expression, we must make some assumptions on one of these distributions. At this point, 
we have the freedom to examine certain models of P{fi), and by viewing the expression 
P(/2) exp{^- //j/ij} as the partition function of a certain spin system with a non- 
uniform, random field {H^} (whose realization is {hi}), we can borrow techniques from 
statistical physics to analyze its behavior. Evidently, for every spin glass model that exhibits 
phase transitions, it is conceivable that there will be analogous phase transitions in the 
corresponding signal estimation problem. 

Assuming certain symmetry properties among the various components of s, it would 
be plausible to postulate that all {s} with the same number of I's are equally likely, or 
equivalently, all spin configurations {/i} with the same magnetization 

1 " 
n ^-^ 

have the same probability. This means that P(/x) depends on /x only via m(fj.). Consider 
then the form 

P(/i) = C„exp{n/(m(/i))}, 

where f{m) is an arbitrary function and C„ is a normalization constant. Further, let us 
assume that / is twice differentiable with finite first derivative on [—1, 1]. Clearly, 

Cn = ( ^ exp{n /(m(/i))} j 

= exp |— nmax{'H2((l + 'm)/2) + /(m)}| 

= exp {-n (W2((l + ma)/2) + /(m,))} (44) 

where 'H2{ ) denotes the binary entropy function and nia is the maximizer oi7i2{{i+'iTi) /2) + 
f{m). In other words, is the a-priori magnetization, namely the magnetization that 
dominates -P(^). Of course, when f{m) is linear in m, the components of /x are i.i.d. Note 
that if / is monotonically increasing in m, then P{p) has a sharp peak at m = 1, which 
corresponds to a vanishing fraction of sites with Si = 1, i.e., a sparse signal. Our derivation, 
however, will take place for general /. 



5.5.1 General Solution 



On substituting the above expression of P{n) into that of Z{f3\y), our main concern is then 
how to deal with the expression 



Zi(3\h) = ^P(/x)eS»'^»'^» = C„^exp 



n 



(45) 



We investigate the typical behavior of the partition function, or more precisely, calculate 
the following quantity: 



-\ogE{z{l3\H)\ = -\og 
n y in 



C„ E ( 



> exp < n 



f{m{n)) + i V/iiiJ, 



(46) 
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where H consists of i.i.d. random variables with arbitrary distribution p{H). 

Using large deviations theory, as n — > oo, the dominant value of m in ()46p . henceforth 
denoted as m* is shown to satisfy 

m* = E{tanh{f{m*) + H)} (47) 

and 

£;{tanh^(/'(m*) + H)} > I - (48) 

The detailed analysis is relegated to Appendix l5.5.3[ Clearly, m* is the dominant magneti- 
zation a-posteriori, i.e., the one that dominates the posterior of m(/i) given (a typical) y. 
It is also shown in Appendix 15.5.31 that 

lim -log E \z{f3\H)} = lim - log Cn - ip{m*) (49) 

where 

^(m*) = f'{m*) m* - f{m*) - E {log [2 cosh(/'(m*) + H)] } (50) 



and the normalized exponent of C„ is given by (I44p . Thus the asymptotic normalized 
mutual information is expressed as 

hm = -- + - ln(l +q) + lim + m ). (51) 

n^oo n 2 4 2(1 + q) n^oo n 

For the sparse signal model described by (|40p . -ff is defined by (j42p with replaced by y 
and the expectation over Y is w.r.t. a mixture of two Gaussians: J\f{0,l/f3) with weight 
(1 + ma)/2, and AA(0, cj^ + 1//3) with weight (1 - ma)/2. 
The solution to 

£;{tanh2(/'(m) + H)} = I - jjj^ (52) 



is known as a critical point, beyond which the solution to (147p ceases to be a local maximum 
and it becomes a local minimum. The dominant m* must jump elsewhere. Also, as we vary 
one of the other parameters of the model, it might happen that the global maximum jumps 
from one local maximum to another. 

5.5.2 Special Case with Quadratic Exponent 

In the case where / is quadratic^ in m, i.e., 

/(m) = am + 6m^/2. (53) 

This is similar though not identical to the random-field Curie- Weiss model (RFCW model) 
of spin system^ (cf. e.g., [2] and references therein). Eq. (j47p becomes 

m = E{ta.nh.{bm + a + H)}, 



A quadratic model can be thought of as consisting of the first few terms of the Taylor expansion of a 
smooth function /. 

''There is a certain difference in the sense that in the RFCW {Hi} are i.i.d., whereas here each Hi depends 
on the corresponding fii because the variance of yi depends on whether Hi = —1 ot fj.i — +1. Also as a 
result, {Hi} here are not i.i.d. because they depend on each other via the dependence between {/ii}. These 
differences are not crucial, however. 
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similarly as in the mean field model with a random field [2]. Eq. (|52p for the critical point 
satisfies 

E{tanh'^{bm + a + H)} = 1- (l/b). (54) 

To demonstrate that the global maximum might jump from one local maximum to 
another, consider the quadratic case and assume that /3 and cr^ are so small that the 
fluctuations in H can be neglected. Equation ([17|) can then be approximated by 

m = tanh(6m + a), 

which is actually the same the equation of the magnetization as in the Curie-Weiss model 
(a.k.a. the mean field model or the infinite-range model) of spin arrays (cf. e.g., [22, Sect. 
4.2], [1, Chap. 3], [14, Sect. 4.5.1]), which is actually a special case of the above with Hi = 
for all i. For a = and 6 > 1, this equation has two symmetric non-zero solutions ±mo, 
which both dominate the partition function. If a 7^ but small, then the symmetry is 
broken, and there is only one dominant solution which is about mosgn(a). To approximate 
TTT-o for the case where |a| is small and b is only slightly larger than 1, one can use the Taylor 
expansion of the function tanh(-) (as is customarily done in the theory of the infinite range 
Ising model; see e.g., [22, p. 188, eqs. (4.21a), (4.21b)]) and get 

(bm + a)'^ 

m ~ bm + a . 

3 

Neglecting the contribution of a, we get a simple quadratic equation whose solutions are 
ibmo with tuq = ^■^73(1 — 1/6). Thus, for small values of |a| and b — 1, 

m* « niQ ■ sgn(a), 

and so, m* jumps between +mo and —rriQ as a crosses the origin. Similarly, for a = 0, m* 
jumps from zero to +mo or —mo as b passes the value 6=1 while increasing. 

By ()5ip , the asymptotic normalized mutual information of this model is given by 



I{X;Y) 1 Iw-, N /5 1 + g/2 

lim — ■ — - = h - ln(l + o) + ^V- 

n^oo n 2 4 ^ 2(l + g) 



1 + m„ 1 1 — m„ / 9 1 
+ ^ + 



2 /3 2 V P 



+ W2(^^)+/(m,)+V(m* 



-- + - ln(l + q) + — r 1 H g + rt2 



2 4' ^' 2{l + q)\ 2 V V 2 

bm'^ b(m*\^ 
+ ama + -^- E{\n[2 cosh(6m* + a + H)]} + (55) 

In this special case of quadratic exponent, the Hubbard-Stratonovich transformation can be 
used to obtain an alternative, more straightforward derivation of the mutual information 
result ([55]) . The details are provided in Appendix 15.5.31 

The MMSE is equal to twice the derivative of (|55|) w.r.t. (3. Note that the dominant 
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value m* is dependent on (3. In Appendix 15.5.31 we carry out the calculation and obtain 



lim 

n— »oo 



va.'mse{X\Y) 



n 



a'^q _^ (1 - ma)a'^ 



1 



g(l + g/2) 



2 

1 - 7na 



Covo{y^ ln[2 cosli(6m* + a + H)]} + Eq{H' tanh(6m* + a + H)} 
1 



.(1 + 9) 
where H' is defined by 



2 Covl{y^ ln[2 cosh(6m* + a + i7)]} + £;i{i^'tanh(6m* + a + i?)} 



H' 



(56) 



(57) 



2(1 + g) 2(l + g)2 

which is in fact the derivative of (j42p w.r.t. /3. To ease understanding of the MMSE, we 
evaluate its value in two extreme cases in Appendix 15.5. 3[ 



5.5.3 Discussion 

Returning now to the general expression of the MMSE, it is reasonable to expect that at the 
critical points, where m* jumps from one solution of eq. ()47p to another as the parameters 
of the model vary, the MMSE may also undergo an abrupt change, and so the MMSE may 
be discontinuous (w.r.t. these parameters) at these points. A related abrupt change takes 
place also in the response of the MMSE estimator itself at the critical points: Note that 
m* is the dominant magnetization a-posteriori. Thus, as m* jumps, say, from m* = mi 
to m* = m2, the conditional mean estimator, which is a weighted average of {a;}, transfers 
most of the weight from a set of a;~vectors whose binary support vectors {s} correspond 
to magnetization mi, into another set of a;-vectors supported by {s} with magnetization 
m2- It is not surprising then that this abrupt change in the response of the estimator is 
accompanied by a corresponding sudden drop in the MMSE. 

It is instructive to compare the type of the phase transition in our example to those of 
the ordinary Curie- Weiss model. In the Curie- Weiss model, we have: 

• A first order phase transition w.r.t. the magnetic field (below the critical temperature), 
i.e., the first derivative of the free energy w.r.t. the magnetic field (which is exactly 
the magnetization) is discontinuous (at the point of zero field). 

• A second order phase transition w.r.t. temperature, i.e., the first derivative of the free 
energy w.r.t. temperature (which is related to the internal energy) is continuous, but 
the second derivative (which is related to the specific heat) is not. 

Here, on the other hand, in physics terms, what we observe is a first order phase transition 
w.r.t. temperature. The reason for this discrepancy is that in our model, the dependency 
of the free energy on temperature is introduced via the variables {hi} that play the role of 
magnetic fields. 

In case of quadratic exponent ([53|) . 6 = corresponds to the special case of i.i.d. {Si}. 
In this case, our problem is analogous to a system of non-interacting particles, where of 
course, no phase transitions can exist. Therefore, what we learn from statistical physics 
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here is that phase transitions in the MMSE estimator cannot be a property of the sparsity 
alone (because sparsity may be present also for the i.i.d. case with P{Si = 1} small), but 
rather a property of strong dependency between {Si}, whether it comes with sparsity or 
not. 
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Appendix A — Estimation of Sparse Signals: The Dominant 
Magnetization 

For the time being let us assume that Hi, i = 1, . . . ,n take on values from a discrete set 
{hi, . . . , hx}, where of the n variables, qkn of them taking the value of hk- The sum in ([^6]) 
can be rewritten as 

nf{m{^l)) +^hk^^ki\ (58) 
k=i i=l ) 

where we relabel /ij as ^ki with i = 1, . . . ,qkn for each k. The expectation on the r.h.s. of 
l6ll can be viewed as an integral 

y^hk{qkn)mk\N{dmi,---,dmK) (59) 













|n/(m 









where is a probability measure proportional to the number of sequences with X]i'=i 



rrik- Here m = ^^=1 Qki^k- For uniformly randomly chosen from ±1 sequences, the prob- 
ability measure satisfies large deviations property, the rate function (or entropy) of which 
is obtained as (using the Legendre-Fenchel transform jf| 

/(mi, ...,mK)=Y^^qk (log 2 - (^^) ) • (61) 

Not surprisingly, the rate function achieves its maximum at ruk = 0, k = 1, . . . ,K, where 
the number of ibl's in each subsequence /U^j, i = l,...,qkn is balanced. Due to large 
deviations property, the integral (f59]) is dominated by unique values of nik, k = 1,. . . ,K. 



^By Cramer's theorem [11, Theorem II. 4.1], the probabihty measure of the empirical mean ^Xi of i.i.d. 
random variables Xi satisfy, as n — > 00, the large deviations property with some rate function I{m). The rate 
of the probability measure is given by the Legendre-Fenchel transform of the cumulant generating function 
(logarithm of the moment generating function) [4,11]: 

J(m) = sup |^r??n - log i; I e"^!! . (60) 
It is straightforward to generalize to the product measure of the means of subgroups of i.i.d. random variables. 
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Specifically, we use Varadhan's Theorem [4,11] to obtairj^ 

^log J " J exp|n/(m) + ^/ifc(gfcn)mfc| A^(dmi,..., dm^) 

< f{m) + V] hkqkmk - Hmi, uik) \ 
-1.1] I k=i J 



► sup 

mi,...,mxG[ 

2~"' • sup ip{'mi, . . . jTUk) (63) 

mi,...,mxG[-l,l] 



where we use (1611) and define 



A f ^ \ ^ ^ fl + m \ 

E^^^M^ • (64) 

The maximum of ■0 is achieved by an internal point in (—1, 1)^. This is because 7i2 is 
concave with infinite derivative at the boundary = ±1, whereas the derivative of / is 
finite by assumption. Because the function ip is twice differentiable, at its maximum, the 
gradient of il^ w.r.t. every rrik should be equal to 0, whereas the Hessian of ■0 should be 
negative definite. It can be shown by taking derivative of ip w.r.t. that zero gradient is 
achieved by setting 



tanh ^E (li^i^ + ^kj 



ruk = tanh /' V qirrii + /ifc (65) 



for all k, so that 



K 



m = '^qk tanh (/'(m) + /i^) . (66) 
fc=i 

The Hessian of ■0 is determined by noting that 

t; t; — = Qkqif [m) - qt- (^7) 

where 5k^i is equal to 1 if /c = Z and equal to otherwise. The Hessian is negative definite 
if and only if 

ij^l^x^ f"{m)<Y,qk-^ (68) 

\fc=l / k=l k 

for all Xfc G IR, /c = 1, . . . , i^T, which is equivalent to 

f"{m)< min ^k=i'ik4/i^ ' _ ^gg) 
xi,...,XK f^K ^ 
l^k=l QkXk 



^The Varadhan's Theorem basically states that, if the sequence of probability measures Nn on IR satisfies 
large deviations property with rate function I(m), and that F is continuous and upper bounded on IR, then 

lim -log exp{F(m)}Nn{dm) = sup{F(m) - I{m)} . (62) 

The result can also be generalized to multiple dimensions. 
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Using Lagrange multiplier, the minimum on the r.h.s. of ([69]) is obtained as 1 — Xl^i 
Further, by ([65]) . the condition ([69]) reduces to 

/'» ^ ,_^K . .A...... , (70) 



- J2k=i Qk tanh^(/'(?n) + hk 



In other words, a solution of (|65p is a local maximum of ^/^ if and only if it also satisfies 
(|7U|) . In multiple such solutions exist, the global supremum is identified by comparing the 
corresponding values of ijj. 

In the limit n — > oo, the requirement that Hi take discrete values is not necessary (the 
continuous distribution can be regarded as the limit of a degenerate discrete one). Using 
()66p and ()70p . the dominant magnetization m* satisfy ()47p and (|48p for general distribution 
of H. This can be made precise by formulating a variational problem. 

We also note an alternative technique for evaluating the free energy (I46p using Fourier 
transform and saddle point method, which is standard in statistical mechanics (often with- 
out rigorous justification). Usage of this technique in information theory can be found in 
e.g., [23]. 



Appendix B — Estimation of Sparse Signals: An Alternative 
Derivation of 



In case of quadratic exponent (I53p . the partition function (|45|) can be written using the 
Hubbard-Stratonovich transformation as 

Y^P^^)e^.^^.h, = C,,^exp^a^^^i + ^fl^h^ + ^(^^i^ \ 
fj, { i i ^ j ^ J 

^ exp <^ a ^ + ^ Hihi + 6m ^ ^Uj 



Cn 




dm exp 



On\ — / dm exp 



Cr, 



2vr „ 
nb 





JJ[2cosh(a + 6m + /ii)] 



/ dm exp < n \ > in 2 cosli(a + bm + hi)\ 



(71) 



Thus, we have — In Z ~ n min^ ip{m) — In Cn, where is defined by (|50p. whose minimum 
is attained at m* = m*{P), one of the solutions to the equation m = E{tanh{bm + a + H}, 
as beforerj The mutual information is then obtained as (1551). 



Appendix C — Estimation of Sparse Signals: The MMSE 

The MMSE is equal to twice the derivative of (j55p w.r.t. /3. We will denote hereafter Hi as 
given by (j42]) with Ui replaced by Yi and H = (Hi, . . . , Hn)- Let us present the asymptotic 
MMSE per sample, lim^^oo iiiiiise(X|"K)/n, as A + B, where A is the double derivative of 

^'-'The function ^p{m) is (within a factor of the inverse temperature) identified with the Landau free energy 
function for this problem [22, p. 186, eq. (4.15a)], [14, Sect. 4.6]. 
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the first three terms, and B is the contribution of the other terms. The easy part is the 
former: 

r q{l + q/2y 



A 



+ 



1 



(1 + 



2(l + <7)2 

As for B, we have the following consideration: The first three terms depend only on ttt-q, 
which in turn is independent of /?, therefore their derivatives w.r.t. P all vanish. For the 
last two terms, pertaining to ip{m*), it proves useful to return to the original expression of 
the Gaussian integral ([TTI) . i.e.. 



B 



2d_ 

n d(3 

2d_ 

n d(5 

2 d 



E{\nZ{[i\H)} 



exp < n 



2b 



1 " 

+ - ln[2 cosh(u + hi)] 



i=l 



2d f bm 1 

— — / dyPg(y)ln / dmexp<n 1 — > ln[2 cosh(6m + a + /ij)] 

n ol3 Jj^n ^ 2 n ^ 



, dPi3{y) , 



dm exp < n 



2 

brn? 



i=l 
n 



IR' 



dyPf}{y)-Q^^'^ j dm exp 



n 



H ln[2 cosh(6m + a + hi)] 

^ i=l 

h - > ln[2 cosh(&m + a + hi)] > 

2 n ^-^ 

i=i J ) 



B1 + B2. 



(72) 



Now, Pg{y) is the mixture of Gaussians weighted by {P(/i)}}, where the dominant fi- 
configurations are those with (1 + ma)/2 (+l)'s and (1 — nia)/2 (— l)'s. Each such config- 
uration contributes the same quantity to Bi and B2, because for every given such /x, the 
random variables {Yi} (and hence also {Hi}) are all independent, a fraction (1 + ma)/2 of 
them are AA(0, and the remaining fraction of (1 — ma)/2 are AA(0, + 1/(5). Thus, it is 

sufficient to confine attention to one such sequence, call it /x*, whose first ni = n{l — ma)/2 
components are all —1 and last n — ni = n(l + ma)/2 components are all +1. Thus, 



Bi 



dm exp < n 



bm? 



1 

H — 7 ln[2 cosh(6m + a + hi)] 
n -f— f 



-CoY\y^Y? + —^ YlY.^n[2zos\i{bm* + a + H,)] 

Vi=l i=ni+l j=l 



1 + ma 



Covo{y^ ln[2 cosh(6m* + a + /f)]} 



+ 



I - ma 



(1 + 



Covi{y2,ln[2cosh(6m* + a + //)]}. 



(73) 
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where CoVs{-, •} denotes covariance with respect to J\f{0,a'^s + s = 0, 1. Finally, for 

B2, we have: 



n 





f 


n J 






fl 


•1 






n 


1 + m 



f d f°° bm 1 

/ dyPf3{y)Tr^ In dmexp<n h - 7 ln[2 cosh(6m + a + /ij)] 



dm E . h[ tanh(6m + a + hi)] e'^'/'M 
J— 00 



£; I ^ ^ i?- tanh(6m* + a + i^j) | 



1 — rr) 

Eo{H' tanh(6m* + a + H)} + ■ Ei{H' tanh(6m* + a + H)], (74) 

z, 2 

where Eg denotes expectation w.r.t. AA(0, a'^s + 1//3), s = 0, 1, and H' is given by ([57|) . and 
correspondingly, /i^ and H[ are given by the same formula with Y replaced by yi and Y! 
respectively. Collecting all terms. A, Bi, and B2, we have ()56p . 



Appendix D — Estimation of Sparse Signals: Two Extreme 
Cases 

Two extreme cases, where it is relatively easy to examine the resulting expression are as 
follows: 

• When 5^1 and a <^ —1, we have ~ — 1 and m* ~ — 1 (which means that most 
Si = 1), and so we can approximate 

ln[2 cosh(6m* + a + H)]^ ln[2 cosh(-6 + a + H)]^b-a-H 

and tanh(5m* + a + H) ~ — 1, and we get 

MMSE(X|1") 
lim , 

n^oo n 1 + Q' 

the classical Wiener expression, as expected 

• When 6^1 and a 3> 1, we have lUa ~ 1 and m* ~ 1 (which means that most Si = 0), 
and then ln[2 cosh(6m* + a + H)] k, b + a + H and tanh(6m* + a + H) ~ 1, so we get 

MMSE(X|y) l-rua o 
lim « • a , 

n^oo n 2 

which means the conditional-mean estimator simply outputs essentially the all-zero 
sequence without attempting to detect (explicitly or implicitly) which of the few signal 
components are active. The intuition behind this behavior is that when there are so 
few active components of the clean signal, then even if there are nevertheless a few 
observations {yi} with large absolute values (and hence could have been suspected 



^^Here, by lim„_,oo MMSE(X |y)/n ~ F{a,b, /3, a'^), for a generic function F, we mean that 
lima_,_oo linift^oo linin^oo nF(a, 6, /3, CT^)/MMSE(X|'F) = 1. A similar comment applies to item number 
2 below. 
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to stem from places where Si = 1), it is still more plausible for the estimator to 
"assmne" that they simply belong to the tail of AA(0, (with Si = 0) rather than 
to AA(0, cr^ + with Sj = 1. This because the prior for Sj = 1 is so small that it 
becomes comparable to the tail probability of Af{0, 
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